Fine-tuning Vectorization and Memory Traffic on Intel Xeon Phi Coprocessors: Lu Decomposition of Small Matrices

نویسنده

  • Andrey Vladimirov
چکیده

Common techniques for fine-tuning the performance of automatically vectorized loops in applications for Intel Xeon Phi coprocessors are discussed. These techniques include strength reduction, regularizing the vectorization pattern, data alignment and aligned data hint, and pointer disambiguation. In addition, the loop tiling technique of memory traffic tuning is shown. The optimization methods are illustrated on an example of single-threaded LU decomposition of a single precision matrix of size 128× 128. Benchmarks show that the discussed optimizations improve the application performance on the coprocessor by a factor of 2.8 compared to the unoptimized code, and by a factor of 1.7 on the multi-core host system, achieving roughly the same performance on the host and on the coprocessor. The code discussed in the paper can be freely downloaded from the Colfax Research Web site. Table of

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Many Core Acceleration of the Boundary Element Method

The Intel Xeon Phi coprocessors provide an efficient tool for the acceleration of scientific codes. Contrary to the GPGPU programming, where the code has to be adapted to the hardware design of the graphics cards, the Intel’s MIC (many integrated core) technology allows for easy porting of the standard CPU code. One of the options to utilize the Xeon Phi coprocessor is to run the code on the CP...

متن کامل

Cluster-level tuning of a shallow water equation solver on the Intel MIC architecture

The paper demonstrates the optimization of the execution environment of a hybrid OpenMP+MPI computational fluid dynamics code (shallow water equation solver) on a cluster enabled with Intel Xeon Phi coprocessors. The discussion includes: 1. Controlling the number and affinity of OpenMP threads to optimize access to memory bandwidth; 2. Tuning the inter-operation of OpenMP and MPI to partition t...

متن کامل

Matrix factorization routines on heterogeneous architectures

In this work we consider a method for parallelizing matrix factorization algorithms on systems with Intel © Xeon Phi TM coprocessors. We provide performance results of matrix factorization routines implementing this approach and available in Intel © Math Kernel Library (Intel MKL) on the Intel © Xeon © processor line with Intel Xeon Phi coprocessors. Summary New heterogeneous systems consisting...

متن کامل

Evaluating kernels on Xeon Phi to accelerate Gysela application

This work describes the challenges presented by porting parts of the gysela code to the Intel Xeon Phi coprocessor, as well as techniques used for optimization, vectorization and tuning that can be applied to other applications. We evaluate the performance of some generic micro-benchmark on Phi versus Intel Sandy Bridge. Several interpolation kernels useful for the gysela application are analyz...

متن کامل

Acceleration of the Boundary Element Library BEM4I on the Knights Corner and Knights Landing Architectures

The aim of the poster is to present the acceleration of the boundary element method (BEM) by the Intel Xeon Phi technology. The poster provides brief overview of BEM followed by the discretization approach and efficient numerical assembly of the BEM matrices. We discuss its parallelization by OpenMP in shared memory and the SIMD vectorization necessary to exploit the full potential of the Xeon ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015